Premas Life Sciences
5/2/2026
Introduction to 16S rRNA Metagenomic Data Analysis
Analysis Workflow
Basics of WSL and R
After taking this workshop, participants should be able to:
Basic idea about working in linux.
Able to configure and install tools for setting up any analysis environment.
Handle raw sequencing files, generate fastq data and perform basic exploratory and statistical analysis.
Who are they?
(species identification; what genes they contain)
Where do they come from?
Are there similarities:
What are they doing?
How are they doing?
(factors influencing the community)
bcl2fastq or bclconvert (Linux OS or WSL)
R and R studio
A shell is a computer program that presents a command line interface which allows you to control your computer using commands entered with a keyboard .
Command Prompt: Not useful for bioinformatics
PowerShell : Can run Conda/CLI tools
WSL
Windows introduced a Linux compatibility layer which allows running Linux inside Windows.
Install from Windows Store
Terminal: A program that runs a shell
Shell: A program that interprets commands
Bash: The most common shell on Linux
zsh: The most common shell on MacOS
PowerShell: The most common shell on Windows
Cmd: The legacy shell on Windows
Fast and efficient way to interact with your computer
Important part of your automation toolbox to create a reproducible data analysis pipeline
Accessing a remote server almost always requires some sort of command line skills
| Description | Win | Linux, MacOS |
|---|---|---|
| Copy files, folders | copy |
cp |
| Move files, folders | move |
mv |
| List folder content | dir |
ls |
| Create new folder | mkdir |
mkdir |
| Change current folder | cd |
cd |
| Show current path. | cd |
pwd |
| Locate a software | where |
which |
| Delete file(s) | del |
rm |
pwd
ls
ls -lh
cd 16s_workshop
cd ..
cd ~
mkdir data
mkdir images
mkdir results
cp sample_R1.fastq.gz data/
mv sample_R1.fastq.gz sample1_R1.fastq.gz
head metadata.tsv
grep -n “control” metadata.tsv
ls -lh data/*.fastq.gz
zcat data/sample1_R1.fastq.gz | head -n 12
zcat data/sample1_R1.fastq.gz | wc -l
touch results/notes.txt
| Suffix | Compress Command | Extract Command | Useful Arguments | What it does |
|---|---|---|---|---|
.zip |
zip |
unzip |
-d, -c, -f |
Windows-style archive, keeps multiple files together |
.gz |
gzip |
gunzip |
-d, -c, -f |
Compresses a single file |
.gz |
– | zcat |
– | View .gz file without extracting |
.tar.gz |
tar -czf |
tar -xzf |
-x, -v, -f, -z |
Linux archive (multiple files + compression) |
.bz2 |
bzip2 |
bunzip2 |
-d, -k, -f |
Like gzip but higher compression |
Relative vs. Absolute Paths
A relative path specifies a location relative to the current directory which is a “fixed location” on your computer
Often, this “fixed location” is the so-called “working directory”
The dot . denotes the current working directory
The dot dot .. denotes the parent directory, i.e., it points upwards in the folder hierarchy
Finally, the tilde symbol ~ will bring you back to your home directory, e.g. cd ~
subfolder/file.txt # Inside a subfolder
./file.txt # Current directory
../file.txt # Parent directory
cd C:\Users\Username\Documents
cd /Users/Username/Documents
cd Users
cd Username
cd Documents
BCL output folder or RUN folder
Samplesheet
bcl2fastq or bclconvert
Software required
fastqc
multiqc
Q = -10 x log10(P), where P is the probability that a base call is erroneous
| Phred Quality Score | Probability of incorrect base call | Base call accuracy |
|---|---|---|
| 10 | 1 in 10 | 90% |
| 20 | 1 in 100 | 99% |
| 30 | 1 in 1000 | 99.9% |
| 40 | 1 in 10,000 | 99.99% |
fastqc -h
fastqc -o results/ *.fastq.gz
Loading R Session…
CRAN : http://cran.r-project.org/
Bioconductor : https://bioconductor.org/
Saving your R environment (.Rdata)
Navigating directories
Set working directories
To understand some of the most basic features of the R language including:
Creating R objects and understanding object types
Using mathematical operations
Using comparison operators
Creating, subsetting, and modifying vectors
#Delete the object
| Operator | Description |
|---|---|
| + | addition |
| - | subtraction |
| * | multiplication |
| / | division |
| ^ or ** | exponentiation |
| a%/%b | integer division (division where the remainder is discarded) |
| a%%b | modulus (returns the remainder after division) |
create an object storing your own age and do some mathmatical operation and print it
| Operator | Description |
|---|---|
| < | less than |
| <= | less than or equal to |
| > | greater than |
| >= | greater than or equal to |
| == | exactly equal to |
| != | not equal to |
| !x | not x |
| a | b | a or b |
| a & b | a and b |
factors
lists
data frames
matrices
List of 3
$ : chr [1:2] "Corynebacterium jeikeium" "Acinetobacter calcoaceticus"
$ : chr [1:8] "SRR1039508" "SRR1039509" "SRR1039512" "SRR1039513" ...
$ : num [1:4] 100 200 300 400
R import functions. These include read.csv(), read.table(), read.delim()
Load the data and filter based on thresholds,counts,metadata
Functions to test
[1] 63677
[1] 8
SRR1039508 SRR1039509 SRR1039512 SRR1039513 SRR1039516
ENSG00000000003 679 448 873 408 1138
ENSG00000000005 0 0 0 0 0
ENSG00000000419 467 515 621 365 587
ENSG00000000457 260 211 263 164 245
ENSG00000000460 60 55 40 35 78
ENSG00000000938 0 0 2 0 1
SRR1039517 SRR1039520 SRR1039521
ENSG00000000003 1047 770 572
ENSG00000000005 0 0 0
ENSG00000000419 799 417 508
ENSG00000000457 331 233 229
ENSG00000000460 63 76 60
ENSG00000000938 0 0 0
SRR1039508 SRR1039509 SRR1039512 SRR1039513 SRR1039516
ENSG00000273488 7 5 8 3 8
ENSG00000273489 0 0 0 1 0
ENSG00000273490 0 0 0 0 0
ENSG00000273491 0 0 0 0 0
ENSG00000273492 0 0 1 0 0
ENSG00000273493 0 0 0 0 1
SRR1039517 SRR1039520 SRR1039521
ENSG00000273488 16 11 14
ENSG00000273489 1 0 0
ENSG00000273490 0 0 0
ENSG00000273491 0 0 0
ENSG00000273492 0 0 0
ENSG00000273493 0 0 0
[1] "SRR1039508" "SRR1039509" "SRR1039512" "SRR1039513" "SRR1039516"
[6] "SRR1039517" "SRR1039520" "SRR1039521"
counts <- read.table("data/Data_for_R_analysis/airway_counts.txt", header = TRUE, sep = "\t", row.names = 1, check.names = FALSE)
meta <- read.table("data/Data_for_R_analysis/airway_metadata.txt", header = TRUE, sep = "\t", check.names = FALSE)
stopifnot(all(colnames(counts) %in% meta$Run))
counts <- counts[, meta$Run] # reorder columns to match metadata
# transform
log_counts <- log2(counts + 1)
libsize <- colSums(counts)
cpm <- t( t(counts) / libsize * 1e6 )
log_cpm <- log2(cpm + 1)
dim(log_cpm)[1] 63677 8
trt untrt
4 4
[1] 63677 8
[1] 16090 8
# install.packages(c("ggplot2", "pheatmap", "corrplot"))
library(ggplot2)
library(pheatmap)
library(corrplot)
df <- data.frame(
value = as.numeric(Xf),
sample = rep(colnames(Xf), each = nrow(Xf))
)
ggplot(df, aes(x = sample, y = value)) +
geom_boxplot(outlier.shape = NA) +
theme_minimal() +
labs(
title = "Log-CPM per sample",
x = "Sample",
y = "log2(CPM + 1)"
) +
theme(axis.text.x = element_text(angle = 45, hjust = 1))library(ggplot2)
df <- data.frame(
value = as.numeric(Xf),
sample = rep(colnames(Xf), each = nrow(Xf))
)
df <- merge(df, meta_sub, by.x = "sample", by.y = "Run")
ggplot(df, aes(x = cell, y = value, fill = cell)) +
geom_boxplot(outlier.shape = NA) +
theme_minimal() +
labs(
title = "Log-CPM by cell type",
x = "Cell type",
y = "log2(CPM + 1)"
)vars <- apply(Xf, 1, var)
top <- order(vars, decreasing = TRUE)[1:500] # top 500 variable genes
X_top <- X[top, ]
pca <- prcomp(t(X_top), center = TRUE, scale. = FALSE)
pca_df <- data.frame(
Sample = rownames(pca$x),
PC1 = pca$x[, 1],
PC2 = pca$x[, 2]
)
#pca_df <- merge(pca_df, meta_sub, by.x = "Sample", by.y = "sample")
pca_df <- merge(pca_df, meta_sub, by.x = "Sample", by.y = "Run")
ggplot(pca_df, aes(PC1, PC2, color = dex)) +
geom_point(size = 4) +
theme_minimal() +
labs(title = "PCA (top variable genes)", x = "PC1", y = "PC2")sds <- apply(Xf, 1, sd)
# Keep only genes with variation
X_var <- X[sds > 0, , drop = FALSE]
# Pick top 50 most variable
top50 <- order(apply(X_var, 1, sd), decreasing = TRUE)[1:50]
X50 <- X_var[top50, , drop = FALSE]
library(pheatmap)
ann <- meta_sub[, c("cell", "dex"), drop = FALSE]
rownames(ann) <- meta_sub$Run
pheatmap(X50,
scale = "row",
annotation_col = ann,
show_rownames = TRUE,
show_colnames = TRUE,
fontsize_row = 8,
fontsize_col = 10,
main = "Top 50 variable genes")▶ Heterogeneity.
▶ Poor data quality.
▶ Structured high-dimensionality.
▶ Reproducibility.
| Aspect | OTUs (Operational Taxonomic Units) | ASVs (Amplicon Sequence Variants) |
|---|---|---|
| Definition | OTUs are clusters of sequences grouped by similarity, usually at a 97% identity threshold. | ASVs are exact DNA sequences inferred from the data after removing sequencing errors. |
| Basic idea | Similar sequences are grouped into the same unit. | Every unique biological sequence is kept. |
| Step | OTU (Clustering approach) | ASV (Error-model approach) |
|---|---|---|
| 1 | Start with all sequencing reads | Start with all sequencing reads |
| 2 | Compare sequences to each other | Learn the sequencing error pattern |
| 3 | Measure similarity between sequences | Separate real variation from noise |
| 4 | Group sequences above a similarity threshold (e.g. 97%) | Infer true biological sequences |
| 5 | Merge similar sequences into one OTU | Keep exact sequences (even 1 base difference) |
| 6 | Assign taxonomy to each OTU | Assign taxonomy to each ASV |
| 7 | Build OTU table (counts per sample) | Build ASV table (counts per sample) |
| Workflow type | Tool | Description | Typical usage |
|---|---|---|---|
| OTU-based | Mothur | A well-established pipeline for OTU clustering and classical microbiome analysis. | Legacy studies, teaching traditional pipelines |
| VSEARCH | Open-source tool for similarity-based clustering and chimera detection. | Fast OTU clustering, open-source replacement for USEARCH | |
| USEARCH | Classic OTU software (not fully open-source, but widely cited). | Historical reference, many old pipelines | |
| ASV-based | DADA2 (R) | Most popular ASV inference tool; builds error models and integrates with phyloseq. | Gold standard in R-based workflows |
| Deblur (QIIME2) | Fast and robust ASV inference using static error profiles. | Large datasets, quick processing | |
| QIIME 2 | Modern microbiome platform supporting both OTUs (legacy) and ASVs (via DADA2/Deblur). | End-to-end analysis, most widely used platform today |
| Command | What we’re doing | |
|---|---|---|
| 1 | cutadapt/filterAndTrim() |
remove primers and quality trim/filter |
| 2 | learnErrors() |
generate an error model of our data |
| 3 | derepFastq |
dereplicate sequences |
| 4 | dada() |
infer ASVs on both forward and reverse reads independently |
| 5 | mergePairs() |
merge forward and reverse reads to further refine ASVs |
| 6 | makeSequenceTable() |
generate a count table |
| 7 | removeBimeraDenovo() |
screen for and remove chimeras |
| 8 | IdTaxa() |
assign taxonomy |
a fasta file of ASVs
count table
taxonomy table.
set.seed(1)
# if (!requireNamespace("BiocManager", quietly = TRUE))
# install.packages("BiocManager")
#
# BiocManager::install("phyloseq")
# install.packages(c("ggplot2", "vegan"))
library(phyloseq)
library(ggplot2)
library(vegan)
n_taxa <- 50
n_samples <- 12
taxa <- paste0("Taxon_bact", 1:n_taxa)
# Fake sample names
samples <- paste0("Sample_", 1:n_samples)
# Simulated count data (like sequencing reads)
counts <- matrix(
rpois(n_taxa * n_samples, lambda = 100),
nrow = n_taxa,
ncol = n_samples,
dimnames = list(taxa, samples)
)
head(counts)
metadata <- data.frame(
Sample = samples,
BodySite = rep(c("Gut", "Skin", "Oral"), each = 4),
DiseaseStatus = rep(c("Healthy", "Disease"), times = 6),
Treatment = rep(c("Control", "Drug"), times = 6)
)
metadata
OTU <- otu_table(as.matrix(counts), taxa_are_rows = TRUE)
META <- sample_data(metadata)
library(phyloseq)
rownames(metadata) <- metadata$Sample
OTU <- otu_table(as.matrix(counts), taxa_are_rows = TRUE)
META <- sample_data(metadata)
stopifnot(identical(sample_names(OTU), sample_names(META)))
ps <- phyloseq(OTU, META)
tax <- data.frame(
Kingdom = rep("Bacteria", nrow(counts)),
Phylum = sample(c("Firmicutes", "Bacteroidetes", "Proteobacteria"),
nrow(counts), replace = TRUE),
row.names = rownames(counts)
)
ps <- merge_phyloseq(ps, tax_table(as.matrix(tax)))
ps
ps
sample_sums(ps) # Total reads per sample
taxa_sums(ps) # Total reads per taxon
#ps_rare <- rarefy_even_depth(ps, rngseed = 123)
#rarecurve(t(otu_table(ps_rare)), step = 50, cex = 0.5)
alpha_div <- estimate_richness(ps, measures = c("Observed", "Shannon", "Simpson"))
alpha_meta <- cbind(alpha_div, sample_data(ps))
library(ggpubr)
p1 <- ggplot(alpha_meta, aes(x = BodySite, y = Shannon, fill = BodySite)) +
geom_boxplot() +
geom_jitter(width = 0.2) +
theme_bw() +
labs(title = "Shannon Diversity by Body Site")
p2 <- ggplot(alpha_meta, aes(x = DiseaseStatus, y = Shannon, fill = DiseaseStatus)) +
geom_boxplot() +
geom_jitter(width = 0.2) +
theme_bw() +
labs(title = "Shannon Diversity by Disease Status")
ggarrange(p1, p2, ncol = 2)
# Taxonomic Composition
# Transform to relative abundance
ps_rel <- transform_sample_counts(ps, function(x) x / sum(x) * 100)
# Bar plot at Phylum level (top 5 phyla)
top_phyla <- names(sort(taxa_sums(ps_rel), decreasing = TRUE))[1:5]
ps_top <- prune_taxa(top_phyla, ps_rel)
plot_bar(ps_top, x = "Sample", fill = "Phylum") +
facet_grid(~ BodySite, scales = "free_x", space = "free") +
theme_bw() +
theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
labs(y = "Relative Abundance (%)",
title = "Top 5 Phyla by Body Site")
# Stacked bar plot by group
plot_bar(ps_rel, x = "BodySite", fill = "Phylum") +
geom_bar(stat = "identity", position = "fill") +
theme_bw() +
labs(y = "Proportion", title = "Phylum Composition by Body Site") +
facet_wrap(~ DiseaseStatus)
ps_hel <- transform_sample_counts(ps, function(x) sqrt(x / sum(x)))
# Bray-Curtis
ord_bray <- ordinate(ps_hel, method = "PCoA", distance = "bray")
p3 <- plot_ordination(ps_hel, ord_bray, color = "BodySite", shape = "DiseaseStatus") +
geom_point(size = 4) +
theme_bw() +
stat_ellipse(aes(group = BodySite)) +
labs(title = "PCoA - Bray-Curtis")
# NMDS plot
ord_nmds <- ordinate(ps_hel, method = "NMDS", distance = "bray")
p4 <- plot_ordination(ps_hel, ord_nmds, color = "BodySite", shape = "Treatment") +
geom_point(size = 4) +
theme_bw() +
labs(title = "NMDS - Bray-Curtis")
library(patchwork)
p3 + p4
## clustering
ps_rel <- transform_sample_counts(ps, function(x) x / sum(x))
top_taxa <- names(sort(taxa_sums(ps_rel), decreasing = TRUE))[1:20]
ps_top <- prune_taxa(top_taxa, ps_rel)
mat <- as.matrix(otu_table(ps_top))
if (!taxa_are_rows(ps_top)) mat <- t(mat)
sample_annot <- data.frame(
DiseaseStatus = sample_data(ps_top)$DiseaseStatus,
BodySite = sample_data(ps_top)$BodySite
)
rownames(sample_annot) <- sample_names(ps_top)
tax_annot <- data.frame(
Phylum = tax_table(ps_top)[, "Phylum"]
)
rownames(tax_annot) <- taxa_names(ps_top)
library(pheatmap)
pheatmap(
mat,
scale = "row", # highlight patterns per taxon
annotation_col = sample_annot, # sample metadata
annotation_row = tax_annot, # taxonomic info
show_rownames = TRUE,
show_colnames = TRUE,
fontsize_row = 8,
fontsize_col = 9,
main = "Heatmap of top taxa (relative abundance)"
)Set up a Linux environment on Windows using WSL
Create an isolated software environment using conda
Install cutadapt in the conda environment
Run a published 16S rRNA analysis using DADA2
Understand how ASVs are generated from raw sequencing data